Machine Learning: Mathematical Theory and Applications

Computer lab 3

Hugo Xinghe CHEN

Published

October 14, 2024

Warning

The instructions in this lab assume that the following packages are installed:

  • keras (note: may require some effort to install, see below)

  • dplyr

  • caret

  • pROC

  • pracma

  • splines

  • grid

  • mvtnorm

Packages may be installed by executing install.packages('packagename'), where 'packagename' is the name of the package, e.g. 'splines'. You may have to install additional packages to solve the computer lab. If you want this file to run locally on your computer, you have to change some paths to where the data are stored (see below).

Introduction

This computer lab treats topics such as dense neural networks, convolutional neural networks, and Gaussian processes.


Instructions

In this computer lab, you will work individually but you are free to discuss with other students, especially the one student you worked with on Computer lab 2 (and will work with on the upcoming Computer lab 4). However, you have to hand in your own set of solutions. It is strictly forbidden to copy each others codes or solutions, as well as the use of AI tools (such as ChatGPT). Not obeying these instructions is regarded as academic misconduct and may have serious consequences for you.

This computer lab is worth a total of 15 marks (the subject has a maximum of 100 marks). The maximum marks for a given problem is shown within parenthesis at the top of the section. The problems you should solve are marked with the symbol 💪 and surrounded by a box. Hand in the solutions and programming codes in the form of a html document generated by Quarto. Before submitting, carefully check that the document compiles without any errors. Use properly formatted figures and programming code.

Warning

Not submitting according to the instructions may result in loss of marks. The same applies to poorly formatted reports (including poorly formatted code/figures, unnecessarily long outputs, etc.).

1. Deep learning for spam email data (classification) (2 marks)

Recall the spam_ham_emails.RData dataset from Computer labs 1 and 2. The dataset consists of \(n=4601\) spam (\(y=1\)) and ham (no spam, \(y=0\)) emails with corresponding 15 features and can be downloaded from the Canvas page of the course1.

Most of the features are continuous real variables in the interval \([0, 100]\), with values corresponding to the percentage of occurrence of a specific word or character in the email. There are also a few features that capture the tendency to use many capital letters. The following code standardises the features and creates a partition with 75% of the data for training and 25% for testing. Some important comments follow after the code.

rm(list=ls()) # Remove variables 
cat("\014") # Clean workspace
load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/spam_ham_emails.RData')
set.seed(12345)
suppressMessages(library(caret))
Spam_ham_emails[, -1] <- scale(Spam_ham_emails[, -1])
Spam_ham_emails[, 'spam'] <- as.integer(Spam_ham_emails[, 'spam'] == 1) 
head(Spam_ham_emails)
  spam       over     remove   internet        free        hpl          X.
1    0 -0.3502281 -0.2917622 -0.2625330 -0.30134485  4.2669752 -0.32987644
2    1  1.6583607 -0.2917622  0.4106637  1.03071019 -0.2992074  0.38364598
3    0 -0.3502281 -0.2917622 -0.2625330 -0.30134485  2.6659926 -0.32987644
4    1  0.6358064 -0.2917622 -0.2625330 -0.30134485 -0.2992074 -0.28451505
5    1  0.3436480  0.1936234 -0.2625330 -0.07126262 -0.2992074 -0.06996793
6    1 -0.3502281  0.9855684  0.9841276 -0.30134485 -0.2992074  0.17522878
        X..1   CapRunMax CapRunTotal        our         hp   george      X1999
1 -0.3083214 -0.17021174  0.02096274 -0.4642639  3.9791176 -0.22787  4.9900587
2  0.8751730 -0.08811470  0.01271665 -0.4642639 -0.3287789 -0.22787 -0.3234205
3 -0.3083214 -0.22665345 -0.40618481 -0.4642639 -0.3287789 -0.22787  2.7702052
4 -0.3083214 -0.09837683 -0.12416847 -0.4642639 -0.3287789 -0.22787 -0.3234205
5  0.9239769  0.40959862  0.75651413 -0.1817414 -0.3287789 -0.22787 -0.3234205
6  1.3672790 -0.17021174 -0.05655052  0.6658261 -0.3287789 -0.22787 -0.3234205
           re       edu
1  0.59185916 -0.197366
2 -0.03086294 -0.197366
3 -0.29774385 -0.197366
4 -0.29774385 -0.197366
5 -0.29774385 -0.197366
6 -0.29774385 -0.197366
train_obs <- createDataPartition(y = Spam_ham_emails$spam, p = .75, list = FALSE)
train <- as.matrix(Spam_ham_emails[train_obs, ])
y_train <- train[, 1]
X_train <- train[, -1]
test <- as.matrix(Spam_ham_emails[-train_obs, ])
y_test <- test[, 1]
X_test <- test[, -1]

# Confirm both training and test are balanced with respect to spam emails
cat("Percentage of training data consisting of spam emails:", 100*mean(train[, 1] == 1))
Percentage of training data consisting of spam emails: 39.46682
cat("Percentage of test data consisting of spam emails:", 100*mean(test[, 1] == 1))
Percentage of test data consisting of spam emails: 39.21739

Note that in the code above, we are not defining the response variable as a factor variable like we did in Computer labs 1 and 2, because the keras package that we will use does not recognise factor variables (the package originated in Python). For the same reason, it does not recognise the data frame structure in R, and therefore we use the as.matrix() function to convert it to a matrix object.

Be aware!

Not ensuring that the inputs to functions in the keras package have the correct format will cause painful and time consuming bugs.

In this problem we use neural networks, also known as deep learning, to classify spam emails via the keras package. Keras is an open-source deep learning framework originally developed in Python but extended to work with R. The package allows you to build, fit, and evaluate deep learning models with a variety of networks.

It requires some effort to install keras in R, particularly on Windows. An alternative Python notebook is provided if you cannot manage to install keras in R (much easier to install in Python). However, I do encourage you to give it a try to install it in R and here follow some (hopefully!) useful instructions. You will need programs that allow compiling source code. On Windows, RTools allows compilation of source code. The corresponding program on Mac is Xcode; see this page for some helpful instructions. After installation of these programs, you have to install the tensorflow package and make sure Python is installed on your system; see this page for some helpful instructions. For Windows, another suggestion is to install anaconda3 and then install a new version of Rstudio through the conda environment (using the Anaconda navigator); see this page for more information page. Another tip that may be useful can be found on this page.

If none of the instructions above work for you, try to Google for an alternative solution. If you still cannot manage to install the keras package in R, I recommend that you do the deep learning part of the lab in Python. A separate Python notebook is available on Canvas for this purpose.

The whole process above uses operations that require administrative privileges on the computer, thus making the keras package not readily available on the work stations in the computer lab room. You should therefore use your own laptop to solve problems when working with the keras package. If you do not have access to a laptop, or a private work station with administrative rights, let me know ASAP.

The following code builds a simple two layer dense neural network. Some explanation of the code follows.

suppressMessages(library(tensorflow))
suppressMessages(library(keras))
#install_tensorflow()
#install_keras()
tensorflow::tf$random$set_seed(12345)
model <- keras_model_sequential() 
model %>% 
  # Add first hidden layer
  layer_dense(units = 12, activation = 'relu', input_shape = c(15)) %>% 
  # Add regularisation via dropout to the first hidden layer
  layer_dropout(rate = 0.3) %>% 
  # Add second hidden layer
  layer_dense(units = 6, activation = 'relu') %>%
  # Add regularisation via dropout to the second hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add layer that connects to the observations
  layer_dense(units = 1, activation = 'sigmoid')
summary(model)
Model: "sequential"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_2 (Dense)                    (None, 12)                      192         
 dropout_1 (Dropout)                (None, 12)                      0           
 dense_1 (Dense)                    (None, 6)                       78          
 dropout (Dropout)                  (None, 6)                       0           
 dense (Dense)                      (None, 1)                       7           
================================================================================
Total params: 277 (1.08 KB)
Trainable params: 277 (1.08 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________

The tensorflow package is needed to set the seed for reproducibility on the third line (the usual set.seed() call does not work in keras). The %>% operator is called the pipe operator. It is a useful operator that allows you to chain together multiple calls in a concise and readable way. The pipe operator takes the output of the left-hand side expression and passes it as the first argument to the right-hand side expression.

The keras_model_sequential() function initialises an empty model that we can stack layers on. The first (from the left) call layer_dense() stacks the first dense (i.e. fully connected) layer with 12 units (specified by the argument units) to the model. Since this is the first layer, the dimension of the input has to be specified via the argument input. Note that our feature matrix does not contain the intercept and thus has dimension 15. Importantly, layer_dense() includes an intercept by default via the argument use_bias=TRUE (recall that bias means intercept in the deep learning world). It is therefore important to always omit the intercept term in the feature matrix (unless use_bias=FALSE is set in the first layer, however, this is unusual to do). The activation argument to the layer_dense() function specifies the activation function for the corresponding layer. Next, the call layer_dropout() adds regularisation (to prevent overfitting) via dropout to the preceding layer, where the rate argument specifies the fraction of neurons (units) to randomly dropout (set to zero) during training in the corresponding layer. Finally, the summary() function summarises the object.

The model above is said to have two (hidden) layers; the last layer only connects to the output. Following the notation in Lindholm et al. (2022)2, the model specified above has the structure

\[\begin{align*} \boldsymbol{q}^{(1)} & =h\left(\boldsymbol{W}^{(1)}\mathbf{x}+\boldsymbol{b}^{(1)}\right)\\ \boldsymbol{q}^{(2)} & =h\left(\boldsymbol{W}^{(2)}\boldsymbol{q}^{(1)}+\boldsymbol{b}^{(2)}\right)\\ \Pr(y =1|\mathbf{x}) & = g\left(\boldsymbol{W}^{(3)}\boldsymbol{q}^{(2)}+b^{(3)}\right), \end{align*}\]

with activation functions ReLU \(h\) and sigmoid \(g\).

💪 Problem 1.1

What are the dimensions of \(\boldsymbol{q}^{(1)},\boldsymbol{q}^{(2)}, \boldsymbol{b}^{(1)},\boldsymbol{b}^{(2)}, b^{(3)}, \boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)}\), and \(\Pr(y =1|\mathbf{x})\)?

Dimensions: q1: (12, 1) q2: (6, 1) b1: (12, 1) b2: (6, 1) b3: (1, 1) W1: (12, 15) W3: (1, 6) Pr(y=1|x): (1, 1)

💪 Problem 1.2

What is the number of parameters for each of the three equations above (\(\mathbf{x}\) and \(\boldsymbol{q}\) are not parameters)? Verify that this agrees with the output of summary(model) above.

The equations show the 3 hidden layers. The feature matrix has 15 features and the first layer has 12 units. Number of weights: 15 * 12 = 180 There is one bias for each of the 12 units in the first hidden layer. Number of biases: 12 Number of parameters for the first equation/layer: 180 + 12 = 192 The first hidden layer has 12 units, and the second hidden layer has 6 units. Number of weights: 12 * 6 = 72 Number of biases: 6 Number of parameters for the second equation/layer: 72 + 6 = 78 The second hidden layer has 6 units, and the output layer has 1 unit. Number of weights: 6 * 1 = 6 Number of biases: 1 Number of parameters for the second equation/layer: 6 + 1 = 7 Total number of parameters in the model: 192 + 78 + 7 = 277 (It correspond to the output of summary(model) perfectly)

We now compile the model.

model %>% compile(loss = 'binary_crossentropy', optimizer = 'SGD', metrics = c('accuracy', 'AUC'))

The argument loss specifies the objective function to minimise when learning the algorithm. Since this a binary classification problem we use the binary crossentropy (recall that crossentropy is related to the likelihood). If we instead deal with a multi-class classification problem, the categorical crossentropy is an option, specified as loss='categorical_crossentropy' . For a regression problem, on the other hand, loss='mse' (mean squared error) is appropriate. The argument optimizer='SGD' uses the optimiser stochastic gradient descent (covered in Computer lab 2). More advanced optimisers such as 'RMSprop' or 'adam' can be used. Finally, the argument metrics specifies the metrics used to evaluate the performance of the model. The metrics are calculated after each epoch of training, and thus track the progress of the model fitting.

The following code fits the model running the stochastic optimiser for 200 epochs with a mini-batch size of 50. The validation_split argument specifies the ratio of the training data used to compute the validation metrics.

model_fit <- model %>% fit(X_train, y_train, epochs = 200, batch_size = 50, validation_split = 0.2, verbose = 0)
plot(model_fit)

Note the peculiar result that the model is performing better on the validation dataset than the training dataset. One explanation is that the real dataset included many more features, but I just kept 15 of them for simplicity. The dataset thus contain many duplicates and if the validation set is close to a subset of the training set, this result (i.e. performing better on the validation set) is not too surprising.

In the optimisation above, we note that the performance metrics, as well as the validation loss, do not improve much after some epochs. We could thus stop the optimisation prematurely if the loss, or the evaluation metrics, for the validation set does not improve. This is known as regularisation via early stopping. We first have to create an early stopping callback via the callback_early_stopping() function.

early_stopping <- callback_early_stopping(monitor="val_loss", patience = 10, restore_best_weights = TRUE)

The argument monitor specifies which quantity is monitored to determine if early stopping should occur. The argument patience=10 executes the early stopping if there is no improvement in the monitored quantity for 10 consecutive epochs. Finally, restore_best_weights=TRUE restores the weights (parameters) from the epoch with the best value of the monitored quantity (if FALSE the weights from the last step are used). We now show how the callback is used to allow for (possibly) early stopping. For illustration purposes, the code creates a new model instance, since otherwise the fit() function will resume the optimisation where the previous call to fit() left off. Moreover, the optimiser ADAM is used (in place of SGD).

model <- keras_model_sequential()
model %>%
  # Add first hidden layer
  layer_dense(units = 12, activation = 'relu', input_shape = c(15)) %>%
  # Add regularisation via dropout to the first hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add second hidden layer
  layer_dense(units = 6, activation = 'relu') %>%
  # Add regularisation via dropout to the second hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add layer that connects to the observations
  layer_dense(units = 1, activation = 'sigmoid')
summary(model)
Model: "sequential_1"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_5 (Dense)                    (None, 12)                      192         
 dropout_3 (Dropout)                (None, 12)                      0           
 dense_4 (Dense)                    (None, 6)                       78          
 dropout_2 (Dropout)                (None, 6)                       0           
 dense_3 (Dense)                    (None, 1)                       7           
================================================================================
Total params: 277 (1.08 KB)
Trainable params: 277 (1.08 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# Compile model
model %>% compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = c('accuracy', 'AUC'))
# Fit model
model_fit <- model %>% fit(X_train, y_train, epochs = 200, batch_size = 50, validation_split = 0.2, callbacks = list(early_stopping), verbose = 0)
plot(model_fit)

print(length(model_fit$metrics$val_loss)) # how many epochs before early stopping
[1] 162

Another way of imposing model regularisation is via an explicit penalty, e.g. L1 or L2 penalty, on the model coefficients. The following code applies an L1 penalty on the first hidden layer of the model (in addition to the regularisation by dropout).

model_l1_reg <- keras_model_sequential()
model_l1_reg %>%
  # Add first hidden layer
  layer_dense(units = 12, activation = 'relu', input_shape = c(15), kernel_regularizer = regularizer_l1(l = 0.01)) %>%
  # Add regularisation via dropout to the first hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add second hidden layer
  layer_dense(units = 6, activation = 'relu') %>%
  # Add regularisation via dropout to the second hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add layer that connects to the observations
  layer_dense(units = 1, activation = 'sigmoid')
summary(model_l1_reg)
Model: "sequential_2"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_8 (Dense)                    (None, 12)                      192         
 dropout_5 (Dropout)                (None, 12)                      0           
 dense_7 (Dense)                    (None, 6)                       78          
 dropout_4 (Dropout)                (None, 6)                       0           
 dense_6 (Dense)                    (None, 1)                       7           
================================================================================
Total params: 277 (1.08 KB)
Trainable params: 277 (1.08 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# Compile model
model_l1_reg %>% compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = c('accuracy', 'AUC'))
# Fit model
model_fit_l1_reg <- model_l1_reg %>% fit(X_train, y_train, epochs = 200, batch_size = 50, validation_split = 0.2, verbose = 0)
plot(model_fit_l1_reg)

The argument kernel_regularizer regularises the weights (kernel is synonym for weights in deep learning jargon). It is also possible to regularise the bias (the intercept) via the argument bias_regularizer, however, this is less common for obvious reasons. The regularizer_l1() function applies L1 regularisation and its argument l controls the amount of regularisation (a larger value regularises more, i.e. gives a less flexible model). Other regularisations can also be applied, such as L2 (e.g. regularizer_l2(l=0.01)) or the elastic net penalty (e.g. regularizer_l1_l2(l1=0.01, l2=0,01)). Finally, note that the other layers in the model can also be regularised.

Our final demonstration in this section concerns how to carry out model evaluation and predictions. The same concepts for evaluating classifiers will be used, e.g. the accuracy, confusion matrix, and ROC curve (among others). Note that the confusionMatrix() function requires factor variables as input, which is achieved by using the as.factor() function.

suppressMessages(library(pROC))
# Evaluating on training data
results_train_model <- model %>% evaluate(X_train, y_train)
108/108 - 0s - loss: 0.1855 - accuracy: 0.9368 - auc: 0.9768 - 97ms/epoch - 902us/step
print("2-Layer model")
[1] "2-Layer model"
print(results_train_model)
     loss  accuracy       auc 
0.1854810 0.9368299 0.9768201 
# Evaluating on test data
results_test_model <- model %>% evaluate(X_test, y_test)
36/36 - 0s - loss: 0.1967 - accuracy: 0.9270 - auc: 0.9715 - 48ms/epoch - 1ms/step
print(results_test_model)
     loss  accuracy       auc 
0.1966807 0.9269565 0.9714701 
# Prediction test data
y_prob_hat_test <- model %>% predict(X_test)
36/36 - 0s - 107ms/epoch - 3ms/step
threshold <- 0.5 # Predict spam if probability > threshold
y_hat_test <- as.factor(y_prob_hat_test > threshold)
levels(y_hat_test) <- c("not spam", "spam")
test_spam <- as.factor(test[, 1])
levels(test_spam) <- c("not spam", "spam")
confusionMatrix(data = y_hat_test, test_spam, positive = "spam")
Confusion Matrix and Statistics

          Reference
Prediction not spam spam
  not spam      663   48
  spam           36  403
                                          
               Accuracy : 0.927           
                 95% CI : (0.9104, 0.9413)
    No Information Rate : 0.6078          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8461          
                                          
 Mcnemar's Test P-Value : 0.2301          
                                          
            Sensitivity : 0.8936          
            Specificity : 0.9485          
         Pos Pred Value : 0.9180          
         Neg Pred Value : 0.9325          
             Prevalence : 0.3922          
         Detection Rate : 0.3504          
   Detection Prevalence : 0.3817          
      Balanced Accuracy : 0.9210          
                                          
       'Positive' Class : spam            
                                          
# ROC curve
roc_obj <- roc(response = test_spam, predictor = as.vector(y_prob_hat_test), print.auc = TRUE)
Setting levels: control = not spam, case = spam
Setting direction: controls < cases
plot(roc_obj, legacy.axes = TRUE, col = "cornflowerblue", main = "ROC spam email classifiers")

cat("AUC for deep learning 2 hidden layer model", roc_obj$auc)
AUC for deep learning 2 hidden layer model 0.9714353

💪 Problem 1.3

Fit a one layer dense neural network with 8 hidden units to the spam data using the ADAM optimiser. You can use the same settings as the previous problem, but feel free to experiment. How does this model compare to the two layer dense model above?

model_one_layer <- keras_model_sequential() %>%
  # Add first (and only) hidden layer
  layer_dense(units = 8, activation = 'relu', input_shape = c(15)) %>%
  # Add regularisation via dropout
  layer_dropout(rate = 0.3) %>%
  # Add layer that connects to the output
  layer_dense(units = 1, activation = 'sigmoid')

summary(model_one_layer)
Model: "sequential_3"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_10 (Dense)                   (None, 8)                       128         
 dropout_6 (Dropout)                (None, 8)                       0           
 dense_9 (Dense)                    (None, 1)                       9           
================================================================================
Total params: 137 (548.00 Byte)
Trainable params: 137 (548.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# Compile the model using the ADAM optimizer
model_one_layer %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy', 'AUC')
)

# Set up early stopping
early_stopping <- callback_early_stopping(
  monitor = "val_loss", 
  patience = 10, 
  restore_best_weights = TRUE
)

# Fit the model on the training data
model_fit_one_layer <- model_one_layer %>% fit(
  X_train, y_train, 
  epochs = 200, 
  batch_size = 50, 
  validation_split = 0.2, 
  callbacks = list(early_stopping), 
  verbose = 0
)

plot(model_fit_one_layer)

results_train_one_layer <- model_one_layer %>% evaluate(X_train, y_train)
108/108 - 0s - loss: 0.2145 - accuracy: 0.9290 - auc: 0.9726 - 98ms/epoch - 911us/step
results_test_one_layer <- model_one_layer %>% evaluate(X_test, y_test)
36/36 - 0s - loss: 0.2140 - accuracy: 0.9304 - auc: 0.9684 - 46ms/epoch - 1ms/step
print("One-layer model")
[1] "One-layer model"
print(results_test_one_layer)
     loss  accuracy       auc 
0.2140170 0.9304348 0.9684091 
Here, we have AUC of 0.9679952 for the one layer dense model of 8 hidden units. And 0.9703948 for the two layer dense model (12 and 6 units). So we can say that the two layer dense model performs better than the one layer dense model in this case.

2. Deep learning for bike rental data (regression) (2 marks)

Recall the bike_rental_hourly.csv dataset from Computer labs 1 and 2. The dataset can be downloaded from the Canvas page of the course3.

The dataset contains 17379 hourly observations over the two-year period January 1, 2011 - December 31, 2012. The dataset contains many features that may be useful for predicting the number of rides, such as the hour of the day, temperature, humidity, season, weekday, etc. In this section, we use the dataset that in addition contains features that capture the time series characteristics of the data. The following code builds that dataset, where we use the same training and test data as in Problem 4.2 in Computer lab 1. A minor difference is that we are not including the intercept in the feature matrix, since we will use keras.

rm(list=ls()) # Remove variables
cat("\014") # Clean workspace
suppressMessages(library(dplyr))
suppressMessages(library(splines))
bike_data <- read.csv('/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/bike_rental_hourly.csv')
head(bike_data)
  instant     dteday season yr mnth hr holiday weekday workingday weathersit
1       1 2011-01-01      1  0    1  0       0       6          0          1
2       2 2011-01-01      1  0    1  1       0       6          0          1
3       3 2011-01-01      1  0    1  2       0       6          0          1
4       4 2011-01-01      1  0    1  3       0       6          0          1
5       5 2011-01-01      1  0    1  4       0       6          0          1
6       6 2011-01-01      1  0    1  5       0       6          0          2
  temp  atemp  hum windspeed casual registered cnt
1 0.24 0.2879 0.81    0.0000      3         13  16
2 0.22 0.2727 0.80    0.0000      8         32  40
3 0.22 0.2727 0.80    0.0000      5         27  32
4 0.24 0.2879 0.75    0.0000      3         10  13
5 0.24 0.2879 0.75    0.0000      0          1   1
6 0.24 0.2576 0.75    0.0896      0          1   1
bike_data$log_cnt <- log(bike_data$cnt)
bike_data$hour <- bike_data$hr/23 # transform [0, 23] to [0, 1]. 0 is midnight, 1 is 11 PM

# One hot for weathersit
one_hot_encode_weathersit <- model.matrix(~ as.factor(weathersit) - 1,data = bike_data)
one_hot_encode_weathersit  <- one_hot_encode_weathersit[, -1] # Remove reference category
colnames(one_hot_encode_weathersit) <- c('cloudy', 'light rain', 'heavy rain')
bike_data <- cbind(bike_data, one_hot_encode_weathersit)

# One hot for weekday
one_hot_encode_weekday <- model.matrix(~ as.factor(weekday) - 1,data = bike_data)
one_hot_encode_weekday  <- one_hot_encode_weekday[, -1] # Remove reference category
colnames(one_hot_encode_weekday) <- c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
bike_data <- cbind(bike_data, one_hot_encode_weekday)

# One hot for weekday
one_hot_encode_season <- model.matrix(~ as.factor(season) - 1,data = bike_data)
one_hot_encode_season  <- one_hot_encode_season[, -1] # Remove reference category
colnames(one_hot_encode_season) <- c('Spring', 'Summer', 'Fall')
bike_data <- cbind(bike_data, one_hot_encode_season)

# Create lags
bike_data_new <- mutate(bike_data, lag1 = lag(log_cnt, 1), lag2 = lag(log_cnt, 2),
                        lag3 = lag(log_cnt, 3), lag4 = lag(log_cnt, 4), lag24 = lag(log_cnt, 24))

bike_data_new <- bike_data_new[-c(1:24),] # Lost 24 obs because of lagging

# Create training and test data
bike_all_data_train <- bike_data_new[bike_data_new$dteday >= as.Date("2011-01-01") & bike_data_new$dteday <=  as.Date("2012-05-31"), ]
bike_all_data_test <- bike_data_new[bike_data_new$dteday >= as.Date("2012-06-01") & bike_data_new$dteday <=  as.Date("2012-12-31"), ]
X_train <- bike_all_data_train[, c("lag1", "lag2",  "lag3", "lag4", "lag24")]
spline_basis <- ns(bike_all_data_train$hour, df = 10, intercept = FALSE)
X_train <- cbind(X_train, spline_basis)
colnames(X_train)[1] <- "intercept"
knots <- attr(spline_basis, "knots")
variables_to_keep_in_X <- c("yr", "holiday", "workingday", "temp", "atemp", "hum", "windspeed")
variables_to_keep_in_X <- c(variables_to_keep_in_X, colnames(one_hot_encode_weathersit), colnames(one_hot_encode_weekday), colnames(one_hot_encode_season))
X_train <- cbind(X_train, bike_all_data_train[, variables_to_keep_in_X])
head(X_train)
   intercept     lag2     lag3     lag4    lag24          1          2
25  3.663562 3.332205 3.526361 3.583519 2.772589 0.00000000 0.00000000
26  2.833213 3.663562 3.332205 3.526361 3.688879 0.01785714 0.00000000
27  2.833213 2.833213 3.663562 3.332205 3.465736 0.14285714 0.00000000
28  2.197225 2.833213 2.833213 3.663562 2.564949 0.41785714 0.01428571
29  1.791759 2.197225 2.833213 2.833213 0.000000 0.62857143 0.11428571
30  1.098612 1.791759 2.197225 2.833213 0.000000 0.34285714 0.58095238
            3 4 5 6 7            8           9           10 yr holiday
25 0.00000000 0 0 0 0  0.000000000 0.000000000  0.000000000  0       0
26 0.00000000 0 0 0 0 -0.116963312 0.350889936 -0.233926624  0       0
27 0.00000000 0 0 0 0 -0.175067667 0.525203001 -0.350135334  0       0
28 0.00000000 0 0 0 0 -0.138695850 0.416087550 -0.277391700  0       0
29 0.00000000 0 0 0 0 -0.065197614 0.195592842 -0.130395228  0       0
30 0.06666667 0 0 0 0 -0.002414726 0.007244179 -0.004829453  0       0
   workingday temp  atemp  hum windspeed cloudy light rain heavy rain Mon Tue
25          0 0.46 0.4545 0.88    0.2985      1          0          0   0   0
26          0 0.44 0.4394 0.94    0.2537      1          0          0   0   0
27          0 0.42 0.4242 1.00    0.2836      1          0          0   0   0
28          0 0.46 0.4545 0.94    0.1940      1          0          0   0   0
29          0 0.46 0.4545 0.94    0.1940      1          0          0   0   0
30          0 0.42 0.4242 0.77    0.2985      0          1          0   0   0
   Wed Thu Fri Sat Spring Summer Fall
25   0   0   0   0      0      0    0
26   0   0   0   0      0      0    0
27   0   0   0   0      0      0    0
28   0   0   0   0      0      0    0
29   0   0   0   0      0      0    0
30   0   0   0   0      0      0    0
# Training data
X_train <- as.matrix(X_train)
y_train <- bike_all_data_train$log_cnt
# Test data
y_test <- bike_all_data_test$log_cnt
X_test <- bike_all_data_test[, c("lag1", "lag2",  "lag3", "lag4", "lag24")]
spline_basis_test <- ns(bike_all_data_test$hour, df=10, knots=knots, intercept = FALSE)
X_test <- cbind(X_test, spline_basis_test)
colnames(X_test)[1] <- "intercept"
X_test <- cbind(X_test, bike_all_data_test[, variables_to_keep_in_X])
X_test <- as.matrix(X_test)

💪 Problem 2.1

Fit a deep learning model with three hidden layers to the bike rental data. The number of units should be, for each level respectively, 16 (first hidden layer), 8, and 4 (last hidden level). Use ReLU activation functions in all layers. You are free to choose optimisation method and settings, and you may add regularisation via dropout and/or early stopping and/or penalty.

Tip

Do not just copy code from Problem 1 without thinking about the settings you need to specify. Keep in mind that this is a regression problem (and not a classification problem!).

suppressMessages(library(keras))

model <- keras_model_sequential() %>%
  # Add first hidden layer with 16 units and ReLU activation
  layer_dense(units = 16, activation = 'relu', input_shape = c(ncol(X_train))) %>%
  layer_dropout(rate = 0.3) %>%
  
  # Add second hidden layer with 8 units
  layer_dense(units = 8, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  
  # Add third hidden layer with 4 units
  layer_dense(units = 4, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  
  layer_dense(units = 1)

# Compile the model
model %>% compile(
  loss = 'mse', 
  optimizer = optimizer_adam(), 
  metrics = c('mean_absolute_error')
)

# Early stopping to prevent overfitting
early_stopping <- callback_early_stopping(monitor = "val_loss", patience = 10, restore_best_weights = TRUE)

model_fit <- model %>% fit(
  X_train, y_train,
  epochs = 200,
  batch_size = 50,
  validation_split = 0.2,
  callbacks = list(early_stopping),
  verbose = 0
)

plot(model_fit)

results_test_model <- model %>% evaluate(X_test, y_test)
160/160 - 0s - loss: 0.3387 - mean_absolute_error: 0.4956 - 120ms/epoch - 751us/step
print(results_test_model)
               loss mean_absolute_error 
          0.3386662           0.4955903 

💪 Problem 2.2

Compute the RMSEs for the training and test data.

y_train_pred <- model %>% predict(X_train)
384/384 - 0s - 295ms/epoch - 769us/step
rmse_train <- sqrt(mean((y_train_pred - y_train)^2))

y_test_pred <- model %>% predict(X_test)
160/160 - 0s - 113ms/epoch - 703us/step
rmse_test <- sqrt(mean((y_test_pred - y_test)^2))

cat("RMSE for training data:", rmse_train, "\n")
RMSE for training data: 0.6214067 
cat("RMSE for test data:", rmse_test, "\n")
RMSE for test data: 0.5819503 

💪 Problem 2.3

Plot a time series plot of the response in the original scale (i.e. counts and not log-counts) for the last week of the test data (last \(24\times 7\) observations). In the same figure, plot a time series plot of the fitted values (in the original scale) from Problem 2.1. Comment on the fit.

suppressMessages(library(ggplot2))

# Convert log-count predictions back to original scale (counts)
y_test_pred_exp <- exp(y_test_pred)
y_test_exp <- exp(y_test)

last_week_indices <- (nrow(X_test) - 168 + 1):nrow(X_test)
y_test_last_week <- y_test_exp[last_week_indices]
y_test_pred_last_week <- y_test_pred_exp[last_week_indices]

# Create a data frame for plotting
time_series_data <- data.frame(
  Time = seq(1, 168),
  Actual = y_test_last_week,
  Predicted = y_test_pred_last_week
)

ggplot(time_series_data, aes(x = Time)) +
  geom_line(aes(y = Actual, color = "Actual")) +
  geom_line(aes(y = Predicted, color = "Predicted"), linetype = "dashed") +
  labs(title = "Bike Rental Counts: Actual vs. Predicted (Last Week of Test Data)",
       x = "Time (Hour)",
       y = "Bike Rentals (Counts)") +
  scale_color_manual(values = c("Actual" = "blue", "Predicted" = "red"),
                     name = "Legend") +
  theme_minimal()

The model seems to be good at predicting peaks and valleys but not precise at all. It's not enough performant.

💪 Problem 2.4

Propose a better deep learning model than that in Problem 2.1. Add the predictions of your new model to the figure you created in Problem 2.3.

improved_model <- keras_model_sequential() %>%
  layer_dense(units = 36, activation = 'relu', input_shape = ncol(X_train)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 12, activation = 'relu') %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 4, activation = 'relu') %>%
  layer_dense(units = 1)

improved_model %>% compile(
  loss = 'mean_squared_error',
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = c('mean_absolute_error')
)

# Fit the improved model with early stopping
improved_history <- improved_model %>% fit(
  X_train, y_train,
  epochs = 100,
  validation_split = 0.2,
  callbacks = list(early_stopping),
  batch_size = 32,
  verbose = 0
)

y_test_pred_improved <- improved_model %>% predict(X_test)
160/160 - 0s - 193ms/epoch - 1ms/step
y_test_pred_exp_improved <- exp(y_test_pred_improved)

y_test_pred_last_week_improved <- y_test_pred_exp_improved[last_week_indices]

time_series_data$Predicted_Improved <- y_test_pred_last_week_improved

ggplot(time_series_data, aes(x = Time)) +
  geom_line(aes(y = Actual, color = "Actual")) +
  geom_line(aes(y = Predicted, color = "Predicted"), linetype = "dashed") +
  geom_line(aes(y = Predicted_Improved, color = "Predicted (Improved)"), linetype = "dotdash") +
  labs(title = "Bike Rental Counts: Actual vs. Predicted (Last Week of Test Data)",
       x = "Time (Hour)",
       y = "Bike Rentals (Counts)") +
  scale_color_manual(values = c("Actual" = "black", "Predicted" = "red", "Predicted (Improved)" = "blue"),
                     name = "Legend") +
  theme_minimal()

# Convert predictions back to original scale (counts)
y_test_pred_exp <- exp(y_test_pred)  # From Problem 2.3
y_test_pred_exp_improved <- exp(y_test_pred_improved)  # From Problem 2.4
y_test_exp <- exp(y_test)  # Actual counts

rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for Problem 2.1 model
rmse_2_1 <- rmse(y_test_exp, y_test_pred_exp)

# Calculate RMSE for Problem 2.4 model
rmse_2_4 <- rmse(y_test_exp, y_test_pred_exp_improved)

cat("RMSE for Problem 2.1 model:", rmse_2_1, "\n")
RMSE for Problem 2.1 model: 165.2331 
cat("RMSE for Problem 2.4 (improved) model:", rmse_2_4, "\n")
RMSE for Problem 2.4 (improved) model: 76.09536 

3. Deep learning for classifying images (4 marks)

Arguably, deep learning models first found their big success in image classification problems. The landmark application uses the MNIST dataset, which contains \(60000\) and \(10000\), respectively, training and test observations, where each observation consists of \(784\) gray-scale pixel intensities structured in a \(28\times28\) matrix. The pixel intensities range from 0-255, where 0 is (total) black and 255 is (total) white. The data are stored in a so-called array, which is a multi-dimensional data structure. The dataset is available in the keras and is loaded via the call dataset_mnist(). The following code loads and plot the data (requires rotation due to how the data are stored). For computational convenience, we will only use the first \(10000\) training observations (out of \(60000\)) for training the model.

rm(list=ls()) # Remove variables
cat("\014") # Clean workspace
suppressMessages(library(pracma)) # For image (matrix) rotation
suppressMessages(library(caret))
suppressMessages(library(keras))
suppressMessages(library(httr))

# Disable SSL verification
set_config(config(ssl_verifypeer = FALSE))
url <- "https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz"
response <- GET(url)
file_path <- "/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/mnist.npz"

writeBin(response$content, file_path)

mnist <- dataset_mnist(path = file_path)

X_train_array <- mnist$train$x[1:10000, , ]
dim(X_train_array) # 10000x28x28 3D array with 10000 images (each 28-by-28 pixels)
[1] 10000    28    28
y_train_array <- mnist$train$y[1:10000]
length(y_train_array) # 10000 element vector with training labels (0-9)
[1] 10000
X_test_array <- mnist$test$x
y_test_array <- mnist$test$y
# Plot the first training observation (rot90 rotates)
image(rot90(X_train_array[1, ,], -1), col = gray.colors(256, start = 1, end = 0))

# The label of the image above is a five
cat("The label is ", y_train_array[1], sep = "")
The label is 5

The following code re-scales the features (pixel intensities) to \([0, 1]\) and flattens the image from format \(28\times28\) to \(748\). The latter is because we will first fit a dense neural network that, as opposed to the convolutional neural network, does not consider the grid-like structure of the data. Moreover, the labels are transformed to one-hot code using the keras function to_categorical().

X_train <- array_reshape(X_train_array, c(nrow(X_train_array), 784)) # 10000x784 matrix
X_test <- array_reshape(X_test_array, c(nrow(X_test_array), 784))
# rescale to (0, 1)
X_train <- X_train / 255
X_test <- X_test / 255
# One-hot labels
y_train <- to_categorical(y_train_array, 10) # 10000x10 matrix, each row is one-hot (1 for the labelled class and the rest 0)
y_test <- to_categorical(y_test_array, 10)
print(y_train[1, ]) # Represent the label 5 (first element is the label 0)
 [1] 0 0 0 0 0 1 0 0 0 0

The following code builds and fits a two hidden layer dense network with, respectively, 256 and 128 units. See the code for specific settings. Note that we are not monitoring the AUC anymore because the classification problem has multiple classes. The code also predicts the class probabilities for the test images and inspects a special case where the classifier is very uncertain.

model_MNIST_2layer <- keras_model_sequential()
model_MNIST_2layer %>%
  # Add first hidden layer
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
  # Add regularisation via dropout to the first hidden layer
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  # Add regularisation via dropout to the first hidden layer
  layer_dropout(rate = 0.3) %>%
  # Add layer that connects to the observations
  layer_dense(units = 10, activation = 'softmax')
summary(model_MNIST_2layer)
Model: "sequential_6"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_21 (Dense)                   (None, 256)                     200960      
 dropout_13 (Dropout)               (None, 256)                     0           
 dense_20 (Dense)                   (None, 128)                     32896       
 dropout_12 (Dropout)               (None, 128)                     0           
 dense_19 (Dense)                   (None, 10)                      1290        
================================================================================
Total params: 235146 (918.54 KB)
Trainable params: 235146 (918.54 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# Compile model
model_MNIST_2layer %>% compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy'))

# Fit model
model_MNIST_2layer_fit <- model_MNIST_2layer %>% fit(X_train, y_train, epochs = 50, batch_size = 100, validation_split = 0.2, verbose = 0)
plot(model_MNIST_2layer_fit)

# Prediction test data. The ith row is the probability distribution over the classes (0-9) for the ith test image
y_pred_test_dl_2layer <- model_MNIST_2layer %>% predict(X_test)
313/313 - 0s - 380ms/epoch - 1ms/step
# For fun, get an observation where the prediction is not certain
indices <-  which(rowSums(y_pred_test_dl_2layer <= 0.55 & y_pred_test_dl_2layer >= 0.45) == 2) #  Gets observations for which the classifier is unsure (have two cells in the interval (0.45, 0.55).
ind <- indices[1] # Taking the first
barplot(names.arg = 0:9, y_pred_test_dl_2layer[ind, ], col = "cornflowerblue", ylim = c(0, 1), main = paste("Predicted probs of test image ", ind, sep = ""))

cat("Actual label: ", which.max(y_test[ind, ]) - 1, ", Predicted label:", which.max(y_pred_test_dl_2layer[ind, ]) - 1, sep = "")
Actual label: 6, Predicted label:6
image(rot90(X_test_array[ind, ,], -1), col = gray.colors(256, start = 1, end = 0))

💪 Problem 3.1

Is the model over- or underfitting the data? Explain. Given the same model structure, propose a fix to the issue and implement it.

The model is overfitting the data as the training accuracy is globally higher than the validation accuracy, and also the loss of the validation tends to getting higher for further iterations. Here, I propose a fix by adding early stopping and increasing dropout rates from 0.3 to 0.4, it helps prevent overfitting.
early_stopping <- callback_early_stopping(monitor = 'val_loss', patience = 5)

model_MNIST_fixed <- keras_model_sequential()
model_MNIST_fixed %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 10, activation = 'softmax')

model_MNIST_fixed %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

# Fit the model with early stopping
model_MNIST_fixed_fit <- model_MNIST_fixed %>% fit(
  X_train, y_train,
  epochs = 50,
  batch_size = 100,
  validation_split = 0.2,
  callbacks = list(early_stopping),
  verbose = 0
)

plot(model_MNIST_fixed_fit)

model_MNIST_fixed %>% evaluate(X_test, y_test)
313/313 - 1s - loss: 0.1722 - accuracy: 0.9571 - 536ms/epoch - 2ms/step
     loss  accuracy 
0.1721519 0.9571000 
y_pred_test_dl_fixed <- model_MNIST_fixed %>% predict(X_test)
313/313 - 0s - 381ms/epoch - 1ms/step
summary(model_MNIST_fixed)
Model: "sequential_7"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_24 (Dense)                   (None, 256)                     200960      
 dropout_15 (Dropout)               (None, 256)                     0           
 dense_23 (Dense)                   (None, 128)                     32896       
 dropout_14 (Dropout)               (None, 128)                     0           
 dense_22 (Dense)                   (None, 10)                      1290        
================================================================================
Total params: 235146 (918.54 KB)
Trainable params: 235146 (918.54 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________

We now consider including 2D convolutional layers, which are tailored for when the features have a grid-like structure, such as \(28\times 28\) images above (before they were flattened). The data structure in convolutional filters has an additional dimension known as the number of channels. This is essentially the dimension of the output in a pixel. In gray-scale color as in the MNIST data, there is only one channel - the white pixel intensity (with outcomes 0-255, 0 is (total) black and 255 is (total) white). Another example is the RGB color scheme, which has three channels since the pixel intensity is three dimensional (Red, Green, Blue), where each of the colors have a corresponding pixel intensity with outcomes 0-255. For example, the triple (0, 255, 0) represents the color green, and (255, 255, 0) represents the color yellow. The following code shows how to create a dataset suitable for input in a convolutional filter for the MNIST data. Some explanation of the code follows.

X_train <- array(X_train_array, c(10000, 28, 28, 1)) # The last dimension is the channel
X_test <- array(X_test_array, c(10000, 28, 28, 1))
# Transform values into [0,1] range
X_train <- X_train / 255
X_test <- X_test / 255
# One-hot labels
y_train <- to_categorical(y_train_array, 10) # 10000x10 matrix, each row is one-hot (1 for the labelled class and the rest 0)
y_test <- to_categorical(y_test_array, 10)

# Define model
model_MNIST_2conv1layer <- keras_model_sequential() %>%
  # First convolutional layer
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = c(28, 28, 1)) %>%
  # Second convolutional layer
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
  # Add a pooling layer after the second convolutional layer
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  # Add regularisation via dropout to the second convolutional layer
  layer_dropout(rate = 0.4) %>%
  # Flatten the output of the preceeding layer
  layer_flatten() %>%
  # A third layer fully connected (input has been flattened)
  layer_dense(units = 128, activation = 'relu') %>%
  # Add regularisation via dropout to preceeding layer
  layer_dropout(rate = 0.4) %>%
  # Add layer that connects to the observations
  layer_dense(units = 10, activation = 'softmax')
# Compile model
model_MNIST_2conv1layer %>% compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy'))
model_MNIST_2conv1layer_fit <- model_MNIST_2conv1layer %>% fit(X_train, y_train, batch_size = 100, epochs = 15, validation_split = 0.2, verbose = 0)
plot(model_MNIST_2conv1layer_fit)

y_pred_test_dl_2conv1layer <- model_MNIST_2conv1layer %>% predict(X_test)
313/313 - 2s - 2s/epoch - 6ms/step

We only explain the calls that have not been explained previously. The layer_conv_2d() call adds a layer with the number of filters specified by the argument filter, where each filter has the size specified by the argument kernel_size. The layer_max_pooling() call adds a layer that down-samples the output of the preceding one, using a special filter without trainable parameters. This special filter takes the maximum value over a region (with size specified by the pool_size argument) of the previous output, see Figure 6.14 in Lindholm et al. (2022). Finally, the layer_flatten() call flattens the output of the preceding layer. The reason for doing this is to allow the next layer to be fully connected.

💪 Problem 3.2

Compare the deep learning model with convolutional layers to that in Problem 3.1. Discuss the results.

plot_comparison <- function(history_dense, history_conv) {
  suppressMessages(library(ggplot2))
  
  # Extract training history data for both models
  history_dense_df <- data.frame(
    epoch = seq_len(length(history_dense$metrics$val_loss)),
    accuracy_dense = history_dense$metrics$accuracy,
    val_accuracy_dense = history_dense$metrics$val_accuracy,
    loss_dense = history_dense$metrics$loss,
    val_loss_dense = history_dense$metrics$val_loss
  )
  
  history_conv_df <- data.frame(
    epoch = seq_len(length(history_conv$metrics$val_loss)),
    accuracy_conv = history_conv$metrics$accuracy,
    val_accuracy_conv = history_conv$metrics$val_accuracy,
    loss_conv = history_conv$metrics$loss,
    val_loss_conv = history_conv$metrics$val_loss
  )

  # Combine data for accuracy plot
  accuracy_df <- merge(history_dense_df, history_conv_df, by = "epoch", all = TRUE)
  
  ggplot(accuracy_df, aes(x = epoch)) +
    geom_line(aes(y = accuracy_dense, color = "Dense Model Training Accuracy")) +
    geom_line(aes(y = val_accuracy_dense, color = "Dense Model Validation Accuracy"), linetype = "dashed") +
    geom_line(aes(y = accuracy_conv, color = "Conv Model Training Accuracy")) +
    geom_line(aes(y = val_accuracy_conv, color = "Conv Model Validation Accuracy"), linetype = "dashed") +
    labs(title = "Training vs. Validation Accuracy: Dense vs Conv Model",
         x = "Epoch", y = "Accuracy") +
    scale_color_manual(values = c("Dense Model Training Accuracy" = "blue", 
                                  "Dense Model Validation Accuracy" = "blue", 
                                  "Conv Model Training Accuracy" = "green",
                                  "Conv Model Validation Accuracy" = "green"),
                       name = "Legend") +
    theme_minimal()

  # Combine data for loss plot
  loss_df <- merge(history_dense_df, history_conv_df, by = "epoch", all = TRUE)

  ggplot(loss_df, aes(x = epoch)) +
    geom_line(aes(y = loss_dense, color = "Dense Model Training Loss")) +
    geom_line(aes(y = val_loss_dense, color = "Dense Model Validation Loss"), linetype = "dashed") +
    geom_line(aes(y = loss_conv, color = "Conv Model Training Loss")) +
    geom_line(aes(y = val_loss_conv, color = "Conv Model Validation Loss"), linetype = "dashed") +
    labs(title = "Training vs. Validation Loss: Dense vs Conv Model",
         x = "Epoch", y = "Loss") +
    scale_color_manual(values = c("Dense Model Training Loss" = "blue", 
                                  "Dense Model Validation Loss" = "blue", 
                                  "Conv Model Training Loss" = "red",
                                  "Conv Model Validation Loss" = "red"),
                       name = "Legend") +
    theme_minimal()
}

plot_comparison(model_MNIST_2layer_fit, model_MNIST_2conv1layer_fit)
Warning: Removed 35 rows containing missing values or values outside the scale range
(`geom_line()`).
Removed 35 rows containing missing values or values outside the scale range
(`geom_line()`).

We can conclude that the convolutional model converges much quicker than the dense model. So the convolutional model performs better in this case.

💪 Problem 3.3

Just before Problem 3.1, we inspected a special case where the classifier (based on the dense layers) was very uncertain. Compute the predicted class probabilities of that particular case with the new model (based on the convolutional filters) and compare to the previous result.

# Find the indices where the dense model's predictions were uncertain (between 0.45 and 0.55 for two classes)
indices_dense <- which(rowSums(y_pred_test_dl_2layer <= 0.55 & y_pred_test_dl_2layer >= 0.45) == 2)
ind <- indices_dense[1] # Use the first uncertain case

# Predict the class probabilities for the test data using the convolutional model
y_pred_test_dl_2conv1layer <- model_MNIST_2conv1layer %>% predict(X_test)
313/313 - 2s - 2s/epoch - 6ms/step
# Extract the predicted probabilities for the specific uncertain case
y_pred_conv_special_case <- y_pred_test_dl_2conv1layer[ind, ]

# Plot comparison of predicted probabilities for the uncertain case
suppressMessages(library(ggplot2))

# Create a data frame for plotting
pred_comparison <- data.frame(
  Class = 0:9,
  Dense = y_pred_test_dl_2layer[ind, ],
  Conv = y_pred_conv_special_case
)

# Melt the data frame to long format for ggplot
suppressMessages(library(reshape2))
pred_comparison_long <- melt(pred_comparison, id = "Class")

# Plot the predicted probabilities from both models
ggplot(pred_comparison_long, aes(x = as.factor(Class), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = paste("Predicted Class Probabilities for Test Image", ind),
       x = "Class", y = "Probability",
       fill = "Model") +
  scale_fill_manual(values = c("Dense" = "blue", "Conv" = "red")) +
  theme_minimal()

# Print the actual label for the test image
cat("Actual label: ", which.max(y_test[ind, ]) - 1, "\n")
Actual label:  6 
# Print the predicted label from the dense model
cat("Dense model predicted label: ", which.max(y_pred_test_dl_2layer[ind, ]) - 1, "\n")
Dense model predicted label:  6 
# Print the predicted label from the convolutional model
cat("Convolutional model predicted label: ", which.max(y_pred_conv_special_case) - 1, "\n")
Convolutional model predicted label:  6 
By comparing the results, we can see that the convolutional model performs much better than the previous one. When the dense model is struggle between 2 classes, the convolutional model gives a extremely high class probability on only 1 class which means it has higher accuracy.

Let us now consider another image dataset with an RGB color scheme. There are 10 classes, consisting of animals and objects, and the objective is to train a neural network that takes a 3 channel input of \(32\times 32\) pixels. The trained model can then be evaluated on a set of \(10000\) test images. The original data contain \(50000\) training samples, but we keep \(10000\) for computational convenience. The following code reads in the data, formats the features, and plots one of the images.

suppressMessages(library(grid))
suppressMessages(library(reticulate))
suppressMessages(library(keras))

# Import Python modules
np <- import("numpy")
pickle <- import("pickle")
builtins <- import_builtins()  # Access Python's built-in functions

# Load CIFAR-10 training batch file
train_file_path <- "/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/cifar-10-batches-py/data_batch_1"

with(builtins$open(train_file_path, "rb") %as% f, {
  train_batch <- pickle$load(f, encoding = "bytes")
})

# Access training data and labels
data_key <- "b'data'"
labels_key <- "b'labels'"

train_images <- train_batch[[data_key]]
train_labels <- train_batch[[labels_key]]

# Reshape training images to 10000x32x32x3
if (exists("train_images")) {
  train_images <- np$reshape(as.integer(train_images), c(10000L, 32L, 32L, 3L))  # Cast to integer explicitly
}

# Rescale training images to (0, 1)
X_train <- train_images / 255 
y_train_array <- train_labels

class_names <- c('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

y_train <- to_categorical(y_train_array, 10)

# Load the CIFAR-10 test batch file
test_file_path <- "/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/cifar-10-batches-py/test_batch"

with(builtins$open(test_file_path, "rb") %as% f, {
  test_batch <- pickle$load(f, encoding = "bytes")
})

test_images <- test_batch[[data_key]]
test_labels <- test_batch[[labels_key]]

if (exists("test_images")) {
  test_images <- np$reshape(as.integer(test_images), c(10000L, 32L, 32L, 3L))
}

X_test <- test_images / 255
y_test_array <- test_labels
y_test_labels <- class_names[y_test_array + 1] 

y_train_labels <- class_names[y_train_array + 1]
obs_to_plot <- which(y_train_labels == "dog")[1]
image_nbr <- obs_to_plot
rgb_image <- rgb(X_train[image_nbr, , , 1], X_train[image_nbr, , , 2], X_train[image_nbr, , , 3])
dim(rgb_image) <- dim(X_train[image_nbr, , , 1])
grid.newpage()
grid.raster(rgb_image, interpolate = FALSE)

cat('Image is a ', y_train_labels[image_nbr], sep = "")
Image is a dog
Tips: This image doesn't show correct RGB colors, but we can more or less see a shape of a dog's face. Probably due to the CIFAR-10 file issues, as I wasn't able to download it directly because of SSL problems.

💪 Problem 3.4

Fit a neural network with at least two hidden convolutional layers to the data above. You are free to choose the settings, such as regularisation (dropout and/or early stopping and/or penalty), validation split, optimiser, etc.

Note

This classification problem is significantly harder than MNIST. You can expect about \(60\text{-}65\%\) accuracy on the test set (not too bad compared to a random classifier which would have \(10\%\)). This can be improved to around \(80\text{-}85\%\) if all training samples are used, however, we do not pursue this here due to computational convenience.

suppressMessages(library(keras))

model_cifar10 <- keras_model_sequential() %>%
  # First convolutional layer
  layer_conv_2d(filters = 32, kernel_size = c(3, 3), activation = 'relu', input_shape = c(32, 32, 3)) %>%
  # Second convolutional layer
  layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = 'relu') %>%
  # Max-pooling layer
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  # Dropout for regularization
  layer_dropout(rate = 0.3) %>%
  # Third convolutional layer
  layer_conv_2d(filters = 128, kernel_size = c(3, 3), activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  # Dropout for regularization
  layer_dropout(rate = 0.4) %>%
  # Flatten the output for fully connected layers
  layer_flatten() %>%
  # Fully connected layer
  layer_dense(units = 128, activation = 'relu') %>%
  # Dropout for regularization
  layer_dropout(rate = 0.5) %>%
  # Output layer (softmax for classification)
  layer_dense(units = 10, activation = 'softmax')

model_cifar10 %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

early_stopping <- callback_early_stopping(monitor = 'val_loss', patience = 5)

model_cifar10_fit <- model_cifar10 %>% fit(
  X_train, y_train,
  epochs = 50,
  batch_size = 128,
  validation_split = 0.2,
  callbacks = list(early_stopping),
  verbose = 0
)

plot(model_cifar10_fit)

model_cifar10 %>% evaluate(X_test, y_test)
313/313 - 3s - loss: 2.3031 - accuracy: 0.0958 - 3s/epoch - 10ms/step
    loss accuracy 
2.303061 0.095800 
y_pred_test_cifar10 <- model_cifar10 %>% predict(X_test)
313/313 - 3s - 3s/epoch - 10ms/step
obs_to_plot <- which(y_train_array == 5)[1] 
image_nbr <- obs_to_plot
rgb_image <- rgb(X_train[image_nbr, , , 1], X_train[image_nbr, , , 2], X_train[image_nbr, , , 3])
dim(rgb_image) <- dim(X_train[image_nbr, , , 1])
grid.newpage()
grid.raster(rgb_image, interpolate = FALSE)

cat('Image is a ', class_names[which.max(y_train[obs_to_plot, ])], sep = "")
Image is a dog
cat('Predicted label: ', class_names[which.max(y_pred_test_cifar10[obs_to_plot, ])], "\n")
Predicted label:  frog 
Also, due to pixel's problem, the model doesn't perform normally.

💪 Problem 3.5

Compute the confusion matrix for the test data. Out of the images that are horses, which two predicted classes are the most common when the classifier is wrong?

suppressMessages(library(caret))

y_pred_prob <- model_cifar10 %>% predict(X_test)
313/313 - 3s - 3s/epoch - 11ms/step
y_pred <- apply(y_pred_prob, 1, which.max) - 1

y_true <- factor(y_test_array, levels = 0:9, labels = class_names)
y_pred_factor <- factor(y_pred, levels = 0:9, labels = class_names)

conf_matrix <- confusionMatrix(y_pred_factor, y_true)
print(head(conf_matrix))
$positive
NULL

$table
          Reference
Prediction plane  car bird  cat deer  dog frog horse ship truck
     plane     0    0    0    0    0    0    0     0    0     0
     car       0    0    0    0    0    0    0     0    0     0
     bird      0    0    0    0    0    0    0     0    0     0
     cat       0    0    0    0    0    0    0     0    0     0
     deer      0    0    0    0    0    0    0     0    0     0
     dog       0    0    0    0    0    0    0     0    0     0
     frog   1000 1000 1000 1000 1000 1000 1000  1000 1000  1000
     horse     0    0    0    0    0    0    0     0    0     0
     ship      0    0    0    0    0    0    0     0    0     0
     truck     0    0    0    0    0    0    0     0    0     0

$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
    0.10000000     0.00000000     0.09418721     0.10604690     0.10000000 
AccuracyPValue  McnemarPValue 
    0.50487592            NaN 

$byClass
             Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: plane           0           1            NaN            0.9        NA
Class: car             0           1            NaN            0.9        NA
Class: bird            0           1            NaN            0.9        NA
Class: cat             0           1            NaN            0.9        NA
Class: deer            0           1            NaN            0.9        NA
Class: dog             0           1            NaN            0.9        NA
Class: frog            1           0            0.1            NaN       0.1
Class: horse           0           1            NaN            0.9        NA
Class: ship            0           1            NaN            0.9        NA
Class: truck           0           1            NaN            0.9        NA
             Recall        F1 Prevalence Detection Rate Detection Prevalence
Class: plane      0        NA        0.1            0.0                    0
Class: car        0        NA        0.1            0.0                    0
Class: bird       0        NA        0.1            0.0                    0
Class: cat        0        NA        0.1            0.0                    0
Class: deer       0        NA        0.1            0.0                    0
Class: dog        0        NA        0.1            0.0                    0
Class: frog       1 0.1818182        0.1            0.1                    1
Class: horse      0        NA        0.1            0.0                    0
Class: ship       0        NA        0.1            0.0                    0
Class: truck      0        NA        0.1            0.0                    0
             Balanced Accuracy
Class: plane               0.5
Class: car                 0.5
Class: bird                0.5
Class: cat                 0.5
Class: deer                0.5
Class: dog                 0.5
Class: frog                0.5
Class: horse               0.5
Class: ship                0.5
Class: truck               0.5

$mode
[1] "sens_spec"

$dots
list()
horse_indices <- which(y_test_array == 7)
wrong_preds <- y_pred[horse_indices][y_pred[horse_indices] != 7]

# Find the two most common misclassifications
top_misclass <- sort(table(wrong_preds), decreasing = TRUE)[1:2]
top_misclass_names <- class_names[as.integer(names(top_misclass)) + 1]

cat("Top two misclassified classes for horse images:", top_misclass_names, "\n")
Top two misclassified classes for horse images: frog NA 

💪 Problem 3.6

Find an image in the test data set that the classifier is uncertain about (i.e. no class has close to probability 1). Plot the image and, moreover, plot the predictive distribution of that image (a bar plot with the probability for each of the classes). Did your classifier end up taking the right decision?

y_pred_probs <- model_cifar10 %>% predict(X_test)
313/313 - 3s - 3s/epoch - 10ms/step
y_pred_classes <- apply(y_pred_probs, 1, which.max) - 1

# Find an uncertain prediction: where max probability < 0.5
uncertain_idx <- which(apply(y_pred_probs, 1, max) < 0.5)[1]

uncertain_image <- X_test[uncertain_idx, , , ]
uncertain_probs <- y_pred_probs[uncertain_idx, ]

grid.newpage()
grid.raster(uncertain_image, interpolate = FALSE)

cat("Uncertain Prediction Index:", uncertain_idx, "\n")
Uncertain Prediction Index: 1 
# True label vs. Predicted label
true_label <- class_names[y_test_array[uncertain_idx] + 1]
predicted_label <- class_names[y_pred_classes[uncertain_idx] + 1]
cat("True Label:", true_label, "\n")
True Label: cat 
cat("Predicted Label:", predicted_label, "\n")
Predicted Label: frog 
barplot(
  uncertain_probs, 
  names.arg = class_names, 
  col = "skyblue", 
  las = 2, 
  ylim = c(0, 1), 
  ylab = "Probability", 
  main = "Predictive Distribution for Uncertain Image"
)

Unfortunately, due to huge uncertainty, the classifier didn't the right decision.

4. Gaussian process prior (3 marks)

A Gaussian process is a non-parametric approach to model a function \(f(\mathbf{x}),\mathbf{x}\in\mathbb{R}^p,f:\mathbb{R}^p\rightarrow \mathbb{R}\). To simplify the implementation in this computer lab, we assume that \(p=1\), i.e. we model a function of a one-dimensional input \(x\).

A corner building stone for Gaussian processes is the so-called kernel. A kernel function specifies how correlated the function values at two inputs, \(x\) and \(x^\prime\) , i.e. \(f(x)\) and \(f(x^\prime)\) , are. Typically, the closer \(x\) and \(x^\prime\) are, the stronger the correlation between \(f(x)\) and \(f(x^\prime)\) . A kernel is said to be stationary if its value depends only on the distance between \(x\) and \(x^\prime\), and not where the inputs are located. One of the simplest yet useful stationary kernels is the squared exponential covariance kernel, which for one-dimensional inputs is defined as \[k(x,x^\prime)=\mathrm{Cov}\left(f(x),f(x^\prime)\right)=\sigma_f^2\exp\left(\frac{1}{2\ell^2}(x-x^\prime)^2\right),\]

where \(\sigma_f^2\) and \(\ell>0\) are, respectively, the scale and the length scale. The scale controls the variance of the kernel and the length scale controls how fast the covariance between \(f(x)\) and \(f(x^\prime)\) (i.e. \(k(x,x^\prime)\)) decays as a function of the distance between \(x\) and \(x^\prime\).

The following function computes the squared exponential kernel \(k(x,x^\prime)\) for any two one-dimensional inputs \(x\) and \(x^\prime\) for given \(\sigma_f^2\) and \(\ell\).

pairwise_cov_squared_exp <- function(x, x_prime, sigma_f, ell) {
  return(sigma_f^2*exp(-1/(2*ell^2)*(x - x_prime)^2))
}

💪 Problem 4.1

Assume \(\sigma_f=1.5\) and \(\ell=0.5\). Use the function above to compute:

  1. The covariance between 0.3 and 0.7.

  2. The covariance between 0.1 and 0.5.

  3. The correlation between -0.2 and -0.5.

Explain why the covariances in 1. and 2. are the same.

Tip

To compute 3., think about how the correlation relates to the covariance.

sigma_f <- 1.5
ell <- 0.5

cov_03_07 <- pairwise_cov_squared_exp(0.3, 0.7, sigma_f, ell)

cov_01_05 <- pairwise_cov_squared_exp(0.1, 0.5, sigma_f, ell)

cov_minus02_minus05 <- pairwise_cov_squared_exp(-0.2, -0.5, sigma_f, ell)
var_minus02 <- pairwise_cov_squared_exp(-0.2, -0.2, sigma_f, ell)
var_minus05 <- pairwise_cov_squared_exp(-0.5, -0.5, sigma_f, ell)

corr_minus02_minus05 <- cov_minus02_minus05 / sqrt(var_minus02 * var_minus05)

cat("Covariance between 0.3 and 0.7:", cov_03_07, "\n")
Covariance between 0.3 and 0.7: 1.633835 
cat("Covariance between 0.1 and 0.5:", cov_01_05, "\n")
Covariance between 0.1 and 0.5: 1.633835 
cat("Correlation between -0.2 and -0.5:", corr_minus02_minus05, "\n")
Correlation between -0.2 and -0.5: 0.8352702 
Because the inputs of 1. and 2. are spaced equally in both instances (0.4 units apart), the covariances are the same, aligning with the stationary nature of the squared exponential kernel.

Assume we observe \(n\) inputs, \(x_1,x_2,\dots,x_n\). The kernel matrix is defined as

\[ \mathbf{K}(\mathbf{X},\mathbf{X})=\left[\begin{array}{cccc} k(x_{1},x_{1}) & k(x_{1},x_{2}) & \cdots & k(x_{1},x_{2})\\ k(x_{2},x_{1}) & k(x_{2},x_{2}) & \cdots & k(x_{2},x_{n})\\ \vdots & \vdots & \ddots & \vdots\\ k(x_{n},x_{1}) & k(x_{n},x_{2}) & \cdots & k(x_{n},x_{n}) \end{array}\right], \]

where \(\mathbf{X}=(x_1,x_2,\dots,x_n)^\top \in \mathbb{R}^n\). Note that the \(i,j\)th element in the kernel matrix contains the pairwise covariance between \(f(x_i)\) and \(f(x_j)\).

💪 Problem 4.2

Compute the kernel matrix (on the input values specified below) with the help of the pairwise_cov_squared_exp() function. Interpret the value of row 2 and column 5 in the kernel matrix. Use the following input values when computing the kernel matrix:

X <- seq(-1, 1, length.out = 21)
Tip

Create a matrix of zeros to store the kernel matrix in and fill its values using a double for-loop.

X <- seq(-1, 1, length.out = 21)

n <- length(X)
K <- matrix(0, n, n)

# Compute the kernel matrix using a double for-loop
for (i in 1:n) {
  for (j in 1:n) {
    K[i, j] <- pairwise_cov_squared_exp(X[i], X[j], sigma_f, ell)
  }
}

#print(K)
# Display the top-left 5x5 portion of the kernel matrix
K_submatrix <- K[1:5, 1:5]
print(K_submatrix)
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 2.250000 2.205447 2.077012 1.879358 1.633835
[2,] 2.205447 2.250000 2.205447 2.077012 1.879358
[3,] 2.077012 2.205447 2.250000 2.205447 2.077012
[4,] 1.879358 2.077012 2.205447 2.250000 2.205447
[5,] 1.633835 1.879358 2.077012 2.205447 2.250000
row_2_col_5 <- K[2, 5]
cat("Value at row 2, column 5 of the kernel matrix:", row_2_col_5, "\n")
Value at row 2, column 5 of the kernel matrix: 1.879358 
cat("This value represents the covariance between the function values f(X[2]) and f(X[5]), where X[2] =", X[2], "and X[5] =", X[5], ".\n")
This value represents the covariance between the function values f(X[2]) and f(X[5]), where X[2] = -0.9 and X[5] = -0.6 .

The way of constructing the kernel matrix above, i.e. using a double for-loop that calls the function pairwise_cov_squared_exp, gives a solid understanding of what the kernel matrix is. However, unfortunately, it becomes computationally expensive, even for a moderate value of \(n\). The following code computes the kernel matrix for a squared exponential kernel avoiding for-loops via the function outer(). This function applies a binary function (specified by the argument FUN, in this case minus) to all possible pairs of elements from the two input vectors, and returns a matrix containing the results. We refer to this as vectorisation, since only vector operations are used.

kernel_matrix_squared_exp <- function(X, Xstar, sigma_f, ell) {
  # Computes the kernel matrix for the squared exponential kernel model
  # Compute the pairwise squared Euclidean distances
  pairwise_squared_distances <- outer(X, Xstar, FUN = "-")^2
  # Compute the kernel matrix element-wise
  kernel_matrix <- sigma_f^2*exp(-1/(2*ell^2)*pairwise_squared_distances)
  return(kernel_matrix)
}

💪 Problem 4.3

Compare the computing times between the two ways of constructing the kernel matrix (i.e. the double for-loop vs the vectorised version). Use the input vector X<-seq(-1,1,length.out=500) when comparing the computing times.

Tip

There are many ways to measure execution time in R. One alternative may be the benchmark() function from the rbenchmark package.

suppressMessages(library(rbenchmark))

X <- seq(-1, 1, length.out = 500)

kernel_matrix_loop <- function(X, sigma_f, ell) {
  n <- length(X)
  K <- matrix(0, n, n)
  for (i in 1:n) {
    for (j in 1:n) {
      K[i, j] <- pairwise_cov_squared_exp(X[i], X[j], sigma_f, ell)
    }
  }
  return(K)
}

# Compare execution time between the two methods
benchmark(
  "Double for-loop" = kernel_matrix_loop(X, sigma_f, ell),
  "Vectorized (outer)" = kernel_matrix_squared_exp(X, X, sigma_f, ell),
  replications = 10
)
                test replications elapsed relative user.self sys.self
1    Double for-loop           10   3.052   46.242     3.029    0.011
2 Vectorized (outer)           10   0.066    1.000     0.051    0.015
  user.child sys.child
1          0         0
2          0         0
The vectorized method is significantly faster than the double for-loop method.

The vectorised version of computing the kernel matrix is significantly more efficient and will be used for the rest of the lab (no more double for-loops!).

We are now ready to define a Gaussian process, abbreviated GP. A GP specifies a prior distribution over functions \(f(x)\). At any given \(x\), \(f(x)\) is treated as a random variable and assigned the prior \(f(x)\sim \mathcal{N}\left(m(x),\sigma_f^2\right)\), where \(m(x)\) and \(\sigma_f^2\) are the prior, respectively, mean and variance. Note that since \(x\) is continuous there are infinitely many (uncountably many!) random variables (one \(f(x)\) for each \(x\)!). Moreover, any two \(f(x)\) and \(f(x^\prime)\) have covariance \(k(x,x^\prime)\).

Gaussian Process

A Gaussian process is an infinite collection of random variables, of which any finite set of them have a multivariate Gaussian distribution.

Denote the vector consisting of a finite set of random variables by \[\mathbf{f}(\mathbf{X})=(f(x_1), f(x_2), \dots, f(x_n))^\top,\]

and the corresponding prior mean

\[\mathbf{m}(\mathbf{X})=(m(x_1), m(x_2), \dots, m(x_n))^\top.\]

Then the prior distribution of the finite set of random variables is multivariate normal

\[\mathbf{f}(\mathbf{X}) \sim \mathcal{N}\left(\mathbf{m}(\mathbf{X}),\mathbf{K}(\mathbf{X}, \mathbf{X}) \right).\]

The following code simulates 5 samples from this Gaussian process prior distribution with the prior mean \(\mathbf{m}(\mathbf{X})\) set to zero, and the prior covariance matrix equal to the kernel matrix \(\mathbf{K}(\mathbf{X}, \mathbf{X})\) for given \(\sigma_f\) and \(\ell\). To get a high resolution of the process, i.e. the function, we create a fine grid of \(x\) values (stored in x_grid).

suppressMessages(library(mvtnorm)) # for multivariate normal
n_grid <- 200
X_grid <- seq(-1, 1, length.out = n_grid)
sigma_f <- 1
ell <- 0.3
m_X <- rep(0, n_grid) # Create zero vector
K_X_X <- kernel_matrix_squared_exp(X_grid, X_grid, sigma_f, ell)
GP_realisations <- rmvnorm(n = 5, mean = m_X, sigma = K_X_X)
matplot(X_grid, t(GP_realisations), type = "l", lty = 1, col = c("cornflowerblue", "lightcoral", "green", "black", "purple"), xlab = "x", ylab = "f(x)", main = "Simulations from the GP prior", xlim=c(-1, 1.5), ylim=c(-3*sigma_f, 3*sigma_f))
legend("topright", legend = c("Sim 1", "Sim 2", "Sim 3", "Sim 4", "Sim 5"), col = c("cornflowerblue", "lightcoral", "green", "black", "purple"), lty = 1)

💪 Problem 4.4

Play around with the length scale \(\ell\) in the code above. Discuss the role of the length scale and its implication for the bias-variance trade off.

ell_values <- c(0.1, 0.5, 1.0)

for (ell in ell_values) {
  K_X_X <- kernel_matrix_squared_exp(X_grid, X_grid, sigma_f, ell)
  GP_realisations <- rmvnorm(n = 5, mean = m_X, sigma = K_X_X)
  
  matplot(X_grid, t(GP_realisations), type = "l", lty = 1, 
          col = c("cornflowerblue", "lightcoral", "green", "black", "purple"),
          xlab = "x", ylab = "f(x)", 
          main = paste("Simulations from the GP prior (ell =", ell, ")"), 
          xlim = c(-1, 1.5), ylim = c(-3*sigma_f, 3*sigma_f))
  
  legend("topright", legend = c("Sim 1", "Sim 2", "Sim 3", "Sim 4", "Sim 5"), 
         col = c("cornflowerblue", "lightcoral", "green", "black", "purple"), 
         lty = 1)
}

A small length scale l causes the GP prior functions to fluctuate rapidly over short distances. A small value of l results in low bias and high variance, indicating greater flexibility but a higher likelihood of overfitting. On the other hand, a large value of l leads to high bias and low variance, implying less flexibility but smoother and more stable outcomes.

Finally, note that we can get the marginal variances of the elements in \(\mathbf{f}(\mathbf{X})\) by considering the diagonal of the matrix \(\mathbf{K}(\mathbf{X}, \mathbf{X})\). It is not hard to show that these are \(\sigma_f^2\). We can now construct a probability interval for the Gaussian process prior, since we know that \(f(x)\sim \mathcal{N}\left(m(x),\sigma_f^2\right)\), \[\left[m(x)-1.96\sigma_f, m(x)+1.96\sigma_f \right]\]

is an approximate \(95\%\) interval. The following code redoes the plot above and adds the interval by plotting a shaded area using a polygon.

matplot(X_grid, t(GP_realisations), type = "l", lty = 1, col = c("cornflowerblue", "lightcoral", "green", "black", "purple"), xlab = "x", ylab = "f(x)", main = "Simulations from the GP prior", xlim=c(-1, 1.5), ylim=c(-3*sigma_f, 3*sigma_f))
legend("topright", legend = c("Sim 1", "Sim 2", "Sim 3", "Sim 4", "Sim 5"), col = c("cornflowerblue", "lightcoral", "green", "black", "purple"), lty = 1)

# Define x values for shading (X_grid for this example)
x_shade <- X_grid
# Lower and upper interval (prior mean is zero)
lower_interval <- -1.96*sigma_f*rep(1, n_grid)
upper_interval <- 1.96*sigma_f*rep(1, n_grid)

# Create a polygon to shade the prediction interval (alpha controls transparency)
polygon(c(x_shade, rev(x_shade)), c(lower_interval, rev(upper_interval)), col = rgb(0, 0, 1, alpha = 0.05), border = NA)

5. Gaussian process posterior (4 marks)

The previous section defined the Gaussian process prior as a prior distribution over an unknown function \(f(x)\). The Gaussian process approach to regression assumes that the response \(y\) follows \[y = f(x)+\varepsilon,\]where \(f(x)\) follows a Gaussian process prior, i.e. \(f(x)\sim\mathcal{GP}\left(m(x), k(x,x^\prime)\right)\), and \(\varepsilon\sim N(0,\sigma_{\varepsilon}^2)\) independent of \(f(x)\). Consider some training data \(\mathbf{y}=(y_1, \dots,y_n)^\top\) and \(\mathbf{X}=(x_1, \dots,x_n)^\top\) . Writing \(\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top\), we can express the model in vector form as

\[\mathbf{y} = \mathbf{f}(\mathbf{X})+\boldsymbol{\varepsilon}.\]

It is common to use the shorthand notation \(\mathbf{f}\) instead of \(\mathbf{f}(\mathbf{X})\).

💪 Problem 5.1

Derive (analytically) \(\mathbb{E}\left(\mathbf{y}\right)\) and \(\mathrm{Cov}\left(\mathbf{y}\right)\).

Tip

The tower property of expectations is

\[ \mathbb{E}\left(\mathbf{y}\right)=\mathbb{E}_\mathbf{f}\left(\mathbb{E}\left(\mathbf{y}|\mathbf{f}\right)\right). \]

The law of total covariance \[\mathrm{Cov}\left(\mathbf{y}\right)= \mathbb{E}_\mathbf{f}\left(\mathrm{Cov}\left(\mathbf{y}|\mathbf{f}\right)\right)+\mathrm{Cov}_\mathbf{f}\left(\mathbb{E}\left(\mathbf{y}|\mathbf{f}\right)\right).\]

The expectation and covariance of the inner expressions are with respect to the distribution of \(\mathbf{y}|\mathbf{f}\), i.e. treating \(\mathbf{f}\) as known.

suppressMessages(library(jpeg))
img <- readJPEG("/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/Problem5_1.jpg")
grid.raster(img)

Consider a new set of inputs \(\mathbf{X}_*=(x_{*1}, \dots,x_{*m})^\top\) for which we want to infer the unobserved random vector \(\mathbf{f}\left(\mathbf{X}_*\right)=\left(f(x_{*1}), \dots,f(x_{*m})\right)^\top\), with shorthand notation \(\mathbf{f}_*\). Intuitively, it seem like a good idea to use what we have observed, i.e. the training data \(\mathbf{y}=(y_1, \dots,y_n)^\top\) (observed at the inputs \(\mathbf{X}=(x_1, \dots,x_n)^\top\)). This is exactly what the Gaussian process posterior does: It derives the distribution of the unknown (unobserved) \(\mathbf{f}_*\) conditional on the observed \(\mathbf{y}\), which we now describe.

From the Gaussian process definition, and using \(\mathbf{y}=\mathbf{f}+\boldsymbol{\varepsilon}\), we can write the joint distribution \(\mathbf{y}\) and \(\mathbf{f}_*\) as a multivariate normal

\[\begin{align*} \left(\begin{array}{c} \mathbf{y}\\ \mathbf{f}_{*} \end{array}\right) \sim \mathcal{N}\left(\left[\begin{array}{c} \mathbf{m}(\mathbf{X})\\ \mathbf{m}(\mathbf{X}_*) \end{array}\right],\left[\begin{array}{cc} \mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_{\varepsilon}^{2}\boldsymbol{I}_{n} & \mathbf{K}(\mathbf{X},\mathbf{X}_{*})\\ \mathbf{K}(\mathbf{X}_{*},\mathbf{X}) & \mathbf{K}(\mathbf{X}_{*},\mathbf{X}_{*}) \end{array}\right]\right), \end{align*}\]

where \(\boldsymbol{I}_{n}\) denotes the \(n\times n\) identity matrix. The conditional distribution \(\mathbf{f}_*\) given \(\mathbf{y}\) is also multivariate normal,

\[\begin{align*} \mathbf{f}_{*}|\mathbf{y} & \sim \mathcal{N}\left(\bar{\mathbf{f}}_{*},\mathrm{Cov}(\mathbf{f}_{*})\right)\\ \bar{\mathbf{f}}_{*} & = \mathbf{m}(\mathbf{X}_*) + \mathbf{K}(\mathbf{X}_{*}, \mathbf{X})\left(\mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_{\varepsilon}^{2}\boldsymbol{I}_{n}\right)^{-1}\left(\mathbf{y} - \mathbf{m}(\mathbf{X}) \right)\\ \mathrm{Cov}(\mathbf{f}_{*}) & = \mathbf{K}(\mathbf{X}_{*},\mathbf{X_{*}})-\mathbf{K}(\mathbf{X}_{*},\mathbf{X})\left(\mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_{\varepsilon}^{2}\boldsymbol{I}_{n}\right)^{-1}\mathbf{K}(\mathbf{X},\mathbf{X_{*}}). \end{align*}\]

We can use the above equations to smooth the training data by letting \(\mathbf{X_{*}}=\mathbf{X}\), i.e. we estimate the process \(\mathbf{f}=\mathbf{f}\left(\mathbf{X}\right)\) at the same inputs as the training data. This is achieved through the conditional distribution \(\mathbf{f}|\mathbf{y},\) which is a special case of the the conditional distribution derived above. Let us illustrate this using the dataset penguins.RData that can be downloaded from the Canvas page of the course. The data contain the dive heart rate (DHR, in beats per minute) during a dive and the corresponding duration of the dive (in minutes) for 125 penguins. The task is to predict the dive heart rate given a duration. The following code smooths the data using a Gaussian process equipped with a squared exponential kernel with parameters \(\sigma_f^2=10000\) and \(\ell=0.6\), and assuming the noise variance \(\sigma^2_{\varepsilon}=150\). The inputs are scaled to the unit interval, and the prior mean of the Gaussian process is assumed to be zero.

load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/3/penguins.RData')
y <- penguins$dive_heart_rate
n <- length(y)
X <- penguins$duration/max(penguins$duration) # Scale duration [0, 1]
plot(X, y, main="DHR vs scaled duration", col = "cornflowerblue", xlab = "Scaled duration", ylab = "Dive heart rate (DHR)")
sigma_f <- 100
ell <- 0.6
sigma_eps <- sqrt(150)
X_star <- X # For smoothing
# Compute means and kernel matrices (smoothing case can reuse computations)
# Prior means
m_X <- rep(0, n)
m_Xstar <- m_X
# Prior covariances
K_X_X <- kernel_matrix_squared_exp(X, X, sigma_f, ell)
K_X_Xstar <- K_X_X
K_Xstar_X <- K_X_Xstar
K_Xstar_Xstar <- K_X_X
# Conditional distribution of f given y is normal. 
fbar_star <- m_Xstar + K_Xstar_X%*%solve(K_X_X + sigma_eps^2*diag(n), y - m_X)
lines(X, fbar_star, col = "lightcoral", type = "p")
legend(x = "topright", pch = c(1, 1), col = c("cornflowerblue", "lightcoral"), legend=c("Data", "Smoothed (fitted) values"))

💪 Problem 5.2

Predict the Gaussian process on a fine grid, x_grid<-seq(0,1,length.out=1000). In the same figure, plot a scatter of the data, the posterior mean of the Gaussian process, and \(95\%\) probability intervals for the Gaussian process. Explain why your interval does not seem to capture \(95\%\) of the data.

Tip

In the smoothing we did above, \(\mathbf{X}_*=\mathbf{X}\). This is not the case here, which has several implications when using the code above.

X_grid <- seq(0, 1, length.out = 1000)

# Compute kernel matrices involving the fine grid
K_X_grid_X <- kernel_matrix_squared_exp(X_grid, X, sigma_f, ell)
K_X_grid_X_grid <- kernel_matrix_squared_exp(X_grid, X_grid, sigma_f, ell)

# Posterior mean and covariance
fbar_star <- K_X_grid_X %*% solve(K_X_X + sigma_eps^2 * diag(n), y)
cov_f_star <- K_X_grid_X_grid - K_X_grid_X %*% solve(K_X_X + sigma_eps^2 * diag(n), t(K_X_grid_X))

# Posterior standard deviations
std_dev <- sqrt(diag(cov_f_star))

plot(X, y, main="Gaussian Process Prediction", col="cornflowerblue", xlab="Scaled duration", ylab="Dive heart rate")

lines(X_grid, fbar_star, col="lightcoral", lwd=2)
lines(X_grid, fbar_star + 1.96 * std_dev, col="grey", lty=2)
lines(X_grid, fbar_star - 1.96 * std_dev, col="grey", lty=2)

legend("topright", legend=c("Data", "Posterior Mean", "95% Interval"),
       col=c("cornflowerblue", "lightcoral", "grey"), lty=c(NA, 1, 2), pch=c(1, NA, NA))

The 95% interval does not seem to capture 95% of the data because the intervals reflect uncertainty about the underlying function f(x), not the noise in the observations. That's why the interval might seem narrower than expected to cover 95% of the noisy data.

In the example above the kernel parameters \(\ell, \sigma_f\) and the noise variance \(\sigma_\varepsilon\) are given. In practice, these have to be estimated from the data. Let \(\boldsymbol{\theta}\) be the collection of kernel parameters and the noise variance (\(\boldsymbol{\theta}=(\ell,\sigma_f,\sigma_\varepsilon)\) for the squared exponential kernel). In Gaussian processes, one approach to estimate \(\boldsymbol{\theta}\) is via the marginal likelihood. The marginal likelihood is the density of the data \(\mathbf{y}\) after integrating out the Gaussian process \(\mathbf{f}\), viewed as a function of \(\boldsymbol{\theta}\) (for fixed \(\mathbf{y}\)). Mathematically, this can be derived as \[p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}, \mathbf{f}|\boldsymbol{\theta})d\mathbf{f}=\int p(\mathbf{y}| \mathbf{f},\boldsymbol{\theta})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f},\]

where \(p(\mathbf{f}|\boldsymbol{\theta})\) is the Gaussian process prior. It is possible to solve this integral analytically, however, a much simpler solution is the following. Since \(\mathbf{y}=\mathbf{f}+\boldsymbol{\varepsilon}\), and both \(\mathbf{f}\) and \(\boldsymbol{\varepsilon}\) are multivariate normal, it follows that \(\mathbf{y}\) is multivariate normal. Together with Problem 5.1, we deduce that \[\mathbf{y}|\boldsymbol{\theta} \sim \mathcal{N}\left(\mathbf{m}(\mathbf{X}), \mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_{\varepsilon}^2\boldsymbol{I}_{n}\right).\]
Note that in the expression above, \(\mathbf{K}(\mathbf{X},\mathbf{X})\) depends on the kernel parameters. The task is now to find the \(\boldsymbol{\theta}\) that maximises the marginal likelihood. As always, the optimisation is carried out in the logarithmic scale for numerical stability.

💪 Problem 5.3

For simplicity, assume that the only unknown parameter is the length scale \(\ell\). Use the optim() function to maximise the log of the marginal likelihood to find the maximum likelihood estimate of \(\ell\). Treat \(\sigma_f\) and \(\sigma_\varepsilon\) as known (fixed at \(\sigma_f=100\) and \(\sigma_\varepsilon=\sqrt{150}\)).

Tip

Write a function that computes the marginal likelihood in the logarithmic scale with the first argument being the length scale (the optim() function optimises over the first argument).

To avoid over- or underflow, always take the logarithm of your expressions by hand before coding them. For example, evaluating log(exp(-1000) in R gives -Inf (negative infinity), but clearly the answer should be -1000.

The optim() function was used for minimising a function in Computer lab 2, but keep in mind that we are maximising in this problem. Moreover, the optim() function will optimise over the length scale, which is a positive quantity by definition. A standard call to optim(), however, will not account for this restriction. We note that the log of the marginal likelihood function in our example will have two modes, since the solution \(\ell_\mathrm{opt}\) will have an equally large function value at \(-\ell_\mathrm{opt}\); \(\log p(\mathbf{y}|\ell)=\log p(\mathbf{y}|-\ell),\forall \ell\), because of the square \(\ell^2\). Thus, if we get a negative solution, we just take its absolute value. More elaborate approaches are (i) impose restrictions in the optimisation or (ii) re-parameterise the model, but this is not required here (see the note Extra below for extra details if interested, but no need to implement).

Extra: Optimisation under restrictions of the parameter space

There are ways to incorporate restrictions on the parameters in the optimisation using the arguments lower and upper in the optim().

Another solution is to re-parameterise the model, for example in terms of \(\phi=\log(\ell^2)\). Now \(\log p(\mathbf{y}|\phi)\neq\log p(\mathbf{y}|-\phi)\), and we can optimise to obtain \(\phi_\mathrm{opt}\). The final estimate is \(\ell_\mathrm{opt}=\sqrt{\exp(\phi_\mathrm{opt})}\).

log_marginal_likelihood <- function(ell) {
  if (ell <= 0) return(-Inf)
  K_X_X <- kernel_matrix_squared_exp(X, X, sigma_f, ell)
  K <- K_X_X + sigma_eps^2 * diag(n)
  L <- chol(K) # Cholesky decomposition
  alpha <- solve(t(L), solve(L, y)) 
  log_det_K <- 2 * sum(log(diag(L)))

  log_marginal_likelihood <- -0.5 * t(y) %*% alpha - 0.5 * log_det_K - n / 2 * log(2 * pi)
  return(as.numeric(log_marginal_likelihood))
}

# Optimization
optim_result <- optim(par = 0.1, fn = function(ell) log_marginal_likelihood(ell), method = "L-BFGS-B", lower = 1e-5, upper = Inf)

ell_opt <- optim_result$par
if (ell_opt < 0) ell_opt <- abs(ell_opt)

cat("Estimated length scale (ell):", ell_opt, "\n")
Estimated length scale (ell): 0.0862343 

💪 Problem 5.4

Another approach (that does not use the marginal likelihood) to estimate \(\boldsymbol{\theta}\) is via cross-validation. Assume again that the only unknown parameter is \(\ell\). Use \(K=5\) fold cross-validation to estimate \(\ell\).

Tip

Compute a cross-validated estimate of a suitable error function (e.g. the root mean squared error of the test data) on a grid of \(\ell\), for example, ell_grid<-seq(0,1,length.out=50). The estimate of \(\ell\) is the value on the grid that minimises the error function.

Note that there are 125 penguins so, with \(K=5\), each fold has a training set and test set of, respectively \(n=100\) and \(m=25\). The penguins are randomly sorted, so no need to reshuffle the data to perform the cross-validation.

# Define the grid for ell
ell_grid <- seq(0.01, 1, length.out = 50)
K <- 5
folds <- cut(1:n, breaks = K, labels = FALSE)

avg_rmse <- numeric(length(ell_grid))

# Cross-validation
for (i in seq_along(ell_grid)) {
  ell <- ell_grid[i]
  fold_rmse <- numeric(K)
  
  for (k in 1:K) {
    test_indices <- which(folds == k)
    train_indices <- setdiff(1:n, test_indices)
    
    X_train <- X[train_indices]
    y_train <- y[train_indices]
    X_test <- X[test_indices]
    y_test <- y[test_indices]
    
    # Compute kernel matrices
    K_train <- kernel_matrix_squared_exp(X_train, X_train, sigma_f = 100, ell = ell)
    K_test_train <- kernel_matrix_squared_exp(X_test, X_train, sigma_f = 100, ell = ell)
    K_train <- K_train + diag(sigma_eps^2, length(train_indices)) 
    
    # Compute predictions
    K_train_inv <- solve(K_train)
    f_bar <- kernel_matrix_squared_exp(X_test, X_train, sigma_f = 100, ell = ell) %*% K_train_inv %*% (y_train)
    
    fold_rmse[k] <- sqrt(mean((y_test - f_bar)^2))
  }
  
  avg_rmse[i] <- mean(fold_rmse)
}

optimal_index <- which.min(avg_rmse)
optimal_ell <- ell_grid[optimal_index]

plot(ell_grid, avg_rmse, type = 'b', xlab = 'Length Scale (ell)', ylab = 'Average RMSE', main = 'Cross-Validated RMSE vs Length Scale')

abline(v = optimal_ell, col = 'red', lty = 2)
text(optimal_ell, min(avg_rmse), labels = paste("Optimal ell:", round(optimal_ell, 2)), pos = 4)

cat("Optimal length scale (ell):", optimal_ell, "\n")
Optimal length scale (ell): 0.252449 

One drawback with the cross-validation compared to the marginal likelihood approach is that it is evaluated on grid. The computation with grid-based methods does not scale well. For example, if we use a grid-based method to estimate the full \(\boldsymbol{\theta}\) with the settings above (three unknowns), it requires \(50^3=125000\) evaluations. Moreover, the grid point that gives the maximum does not coincide with the exact solution due to the low resolution of the grid. The marginal likelihood approach does not have these drawbacks.

💪 Problem 5.5

Assume now the realistic situation that the full \(\boldsymbol{\theta}\) is unknown, i.e. all parameters \(\sigma_f,\ell,\sigma_\varepsilon\). Estimate them by maximising the log of the marginal likelihood using the optim() function (no cross-validation!). Do your estimates coincides with the values I gave you, i.e. \(\sigma_f=100,\ell=0.6,\sigma_\varepsilon=\sqrt{150}\)?

# Define the new function for theta
kernel_matrix_squared_exp <- function(X, Xstar, sigma_f, ell) {
  pairwise_squared_distances <- outer(X, Xstar, FUN = "-")^2
  kernel_matrix <- sigma_f^2 * exp(-1 / (2 * ell^2) * pairwise_squared_distances)
  return(kernel_matrix)
}

# Define the new function for theta
log_marginal_likelihood_full <- function(theta, X, y) {
  sigma_f <- exp(theta[1])
  ell <- exp(theta[2])
  sigma_eps <- exp(theta[3])

  K <- kernel_matrix_squared_exp(X, X, sigma_f, ell) + sigma_eps^2 * diag(length(y))
  L <- chol(K) # Cholesky decomposition
  alpha <- solve(L, solve(t(L), y))
  log_det_K <- 2 * sum(log(diag(L)))
  log_marginal_likelihood <- -0.5 * t(y) %*% alpha - 0.5 * log_det_K - 0.5 * length(y) * log(2 * pi)

  return(-log_marginal_likelihood)
}

# Initial guess
theta <- log(c(100, 0.6, sqrt(150)))

optim_result <- optim(par = theta, fn = log_marginal_likelihood_full, X = X, y = y, method = "L-BFGS-B", control = list(maxit = 500))

sigma_f_opt <- exp(optim_result$par[1])
ell_opt <- exp(optim_result$par[2])
sigma_eps_opt <- exp(optim_result$par[3])

cat("Optimized sigma_f:", sigma_f_opt, "\n")
Optimized sigma_f: 106.1764 
cat("Optimized ell:", ell_opt, "\n")
Optimized ell: 0.6150152 
cat("Optimized sigma_eps:", sigma_eps_opt, "\n")
Optimized sigma_eps: 12.22259 
It seems that we have optimized parameters quite close to the given ones (initial guess).

Footnotes

  1. The original data come from this source. The dataset used here includes a subset of the 57 original features.↩︎

  2. Lindholm, A., Wahlström, N., Lindsten, F. and Schön, T. B (2022). Machine Learning - A First Course for Engineers and Scientists. Cambridge University Press.↩︎

  3. The original data come from this source.↩︎